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What Is Claimed : 

1 . A method of identifying whether a sequence is a semantic unit, the 
method comprising: 

calculating a first value representing a coherence of terms in the 
sequence; 

calculating a second value representing variation of context in which the 
sequence occurs; and 

determining whether the sequence is a semantic unit based at least in part 
on the first and second values. 

2. The method of claim 1 , wherein the coherence of the terms in the 
sequence is calculated relative to a collection of documents. 

3. The method of claim 2, wherein the coherence of the terms in the 
sequence is calculated as a likelihood ratio that defines a probability of the 
sequence occurring in the collection of documents relative to parts of the 
sequence occurring. 

4. The method of claim 2, wherein the coherence of the terms in the 
sequence is calculated as: 



LR(A,B) = - 



L(f(AB),f(A)) L(f(~AB),f(~ A)) 
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where f(A) defines a number of occurrences of term A in the collection of 
documents, f{-A) defines a number of occurrences of a term other than term A 
in the collection of documents, f(B) defines a number of occurrences of term B in 
the collection of documents, N defines a total number of events in the collection 
of documents, f{AB) defines a number of times term A is followed by term B in 
the collection of documents, and f{~AB) is a number of times a term other than A 
is followed by term B in the collection of documents, wherein 



5. The method of claim 1 , wherein the coherence of the terms in the 
sequence are defined as not being sufficient unless a threshold is met. 

6. The method of claim 5, wherein the threshold is defined as: 

f(AB) > , where f{A) defines a number of occurrences of term A in 

N 

the collection of documents, /(S) defines a number of occurrences of term 6 in 
the collection of documents, N defines a total number of events in the collection 
of documents, and f(AB) defines a number of times term A is followed by term 8 
in the collection of documents. 

7. The method of claim 1 , wherein the variation of context in which the 
sequence occurs is calculated relative to a collection of documents. 
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8. The method of claim 7, wherein the variation of context in which the 
sequence occurs is calculated as a measure of entropy of the context of the 
sequence. 

9. The method of claim 7, wherein the variation of context in which the 
sequence occurs, H(S), is calculated as 

HM(S) = MIN(HL(S),HR(S)), 



where MIN defines a minimum operation, S represents the sequence, f(wS) 
defines a number of times a particular term, w, appears in the collection of 
documents followed by the sequence, f(Sw) refers to a number of times the 
sequence is followed by w in the collection of documents, and f(S) refers to a 
number of times the sequence S is present in the collection of documents. 

1 0. The method of claim 7, wherein the variation of context in which the 
sequence occurs, HM(S), is calculated as 




and 



HR(S) = -£ 



AS) \ f(S) ) 



HM(S) = MIN(HLM(S),HRM(S)) , 
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where MIN defines a minimum operation, HLM(S,) is defined as a minimum of 

for egch term . the CO || ect j on 0 f documents, HRM(S) is defined as a 

/(S) 

minimum of 1 - ^ Sw ^ for each term w in the collection of documents, f(wS) 
f(S) 

defines a number of times a particular term, w, appears in the collection of 
documents followed by the sequence, f(Sw) refers to a number of times the 
sequence is followed by w in the collection of documents, and f(S) refers to a 
number of times the sequence is present in the collection of documents. 



1 1 . The method of claim 7, wherein the variation of context in which the 
sequence occurs, HC(S), is calculated as 

HC(S) = MIN(HLC(S),HRC(S)) , 

where MIN defines a minimum operation, HLC(S) is defined as ]££(wS)and 

w 

HRC(S) is defined as ^S(Sw) , where 5(X) is defined as one if sequence X 

w 

occurs in the collection of documents and zero otherwise, where wS refers to a 
particular word followed by the sequence, and where Sw refers to the sequence 
followed by a word. 

1 2. The method of claim 7, wherein the variation of context in which the 
sequence occurs, HP(S), is calculated as 

HP(S) = MIN(HLP(S), HRP(S)) 
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where MIN defines a minimum operation, HLP(S) is defined as the number of 
continuations to the left of the sequence that cover a predetermined percentage 
of all cases in the collection of documents and HRP(S) is defined as the number 
of continuations to the right of the sequence that cover the predetermined 
percentage of all cases in the collection of documents. 

1 3. The method of claim 1 , wherein determining whether the sequence 
is a semantic unit includes comparing the first and second values to first and 
second thresholds and identifying the sequence as a semantic unit when the first 
and second values satisfy the first and second thresholds. 

14. The method of claim 1 , wherein the sequence includes three or 
more words. 

15. The method of claim 1 , further including: 
applying one or more rules to the sequence, and 

wherein determining whether the sequence is a semantic unit is further 
based at least in part on the application of the one or more rules. 

16. A device comprising: 

a coherence component configured to calculate a coherence of multiple 
terms in a sequence of terms; 
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a variation component configured to calculate a variation of context terms 
in a collection of documents in which the sequence occurs; and 

a decision component configured to determine whether the sequence 
constitutes a semantic unit based at least in part on results of the coherence 
component and the variation component. 

1 7. The device of claim 1 6, wherein the context terms include terms to 
the left and right of the sequence. 

1 8. The device of claim 1 6, wherein the coherence of the terms in the 
sequence is calculated relative to the collection of documents. 

1 9. The method of claim 1 8, wherein the coherence of the terms in the 
sequence is calculated as a likelihood ratio that defines a probability of the 
sequence occurring in the collection of documents relative to parts of the 
sequence occurring. 

20. The device of claim 1 6, wherein the variation of context in which the 
sequence occurs is calculated as a measure of entropy of the context of the 
sequence. 

21 . The device of claim 20, wherein the variation of context in which the 
sequence occurs, H(S), is calculated as 
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H(S) = MIN(HL(S),HR(S)), 

r AS) \ AS) 



ASw) . ( ASw)^ 



AS) \ AS) ) 

where MIN defines a minimum operation, S represents the sequence, f(wS) 
defines a number of times a particular term, w, appears in the collection of 
documents followed by the sequence, f(Sw) refers to a number of times the 
sequence is followed by w in the collection of documents, and f(S) refers to a 
number of times the sequence S is present in the collection of documents. 



22. The device of claim 20, wherein the variation of context in which the 
sequence occurs, HM(S), is calculated as 

HM(S) = MAX(HLM(S),HRM(S)), 
where MIN defines a minimum operation, HLMfSJ is defined as a minimum of 

l_A^S) fo|p each term j the CO || ect j on 0 f documents, HRM(S) is defined as a 

AS) 

minimum of 1 - for each term w in the collection of documents, f(wS) 
AS) 

defines a number of times a particular term, w, appears in the collection of 
documents followed by the sequence, f(Sw) refers to a number of times the 
sequence is followed by w in the collection of documents, and f(S) refers to a 
number of times the sequence is present in the collection of documents. 
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23. The device of claim 20, wherein the variation of context in which the 
sequence occurs, HC(S), is calculated as 

HC(S) = MIN(HLC(S), HRC(S)) , 

where MIN defines a minimum operation, HLC(S) is defined as ^S(wS)and 

w 

HRC(S) is defined as J] 5(Sw) , where S(X) is defined as one if sequence X 

w 

occurs in the document collection and zero otherwise, where wS refers to a 
particular word followed by the sequence, and where Sw refers to the sequence 
followed by a word. 

24. The device of claim 20, wherein the variation of context in which the 
sequence occurs, HP(S), is calculated as 

HP(S) = MIN(HLP(S), HRP(S)) 
where MIN defines a minimum operation, HLP(S) is defined as the number of 
continuations to the left of the sequence that cover a predetermined percentage 
of all cases in the collection of documents and HRP(S) is defined as the number 
of continuations to the right of the sequence that cover the predetermined 
percentage of all cases in the collection of documents. 

25. The device of claim 16, wherein the decision component is further 
configured to compare the results of the coherence component and the variation 
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component to threshold values and identify the sequence as a semantic unit 
based on at least in part on the comparisons. 

26. The device of claim 1 6, further comprising: 

a heuristics component configured to apply one or more predefined rules 
to the sequence, wherein the decision component is further configured to 
determine whether the sequence constitutes a semantic unit based at least in 
part on application of the one or more rules. 

27. The device of claim 26, wherein the one or more rules are 
exclusionary rules that determine when certain sequences are not semantic 
units. 

28. A device comprising: 

means for calculating a first value representing a coherence of terms in a 
sequence of terms; 

means for calculating a second value representing variation of context in 
which the sequence occurs; and 

means for determining whether the sequence is a semantic unit based at 
least in part on the first and second values. 
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29. A computer-readable medium that includes programming 
instructions configured to control at least one processor, the computer-readable 
medium comprising: 

instructions for calculating a first value representing a coherence of terms 
in a sequence of terms; 

instructions for calculating a second value representing variation of 
context in which the sequence occurs; and 

instructions for determining whether the sequence is a semantic unit 
based on the first and second values. 
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